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Abstract. We apply the theory of markov random fields on trees to derive 
a phase transition in the number of samples needed in order to reconstruct 
phylogenies. 

We consider the Cavender-Farris-Neyman model of evolution on trees, where 
all the inner nodes have degree at least 3, and the net transition on each edge 
is bounded by e. Motivated by a conjecture by M. Steel, we show that if 
2(1 — 2e)^ > 1, then for balanced trees, the topology of the underlying tree, 
having n leaves, can be reconstructed from O(logn) samples (characters) at 
the leaves. On the other hand, we show that if 2(1 — 2e)^ < 1, then there exist 
topologies which require at least n^(^) samples for reconstruction. 

Our results are the first rigorous results to establish the role of phase tran- 
sitions for markov random fields on trees as studied in probability, statistical 
physics and information theory to the study of phylogenies in mathematical 
biology. 



1. Introduction 

Phylogenetic trees commonly model evolution of species. In this paper we study 
reconstruction of phylogenetic trees from samples (characters) of data at the leaves 
of the tree, and the relationship between this problem and the theory of markov 
random fields on trees. 

We apply tools from the theory of Ising model on trees to derive a phase transition 
in the number of samples needed in order to reconstruct the topology of a tree for 
the Cavender-Farris-Neyman model of evolution. 

Cavender, Farris and Neyman |2H1 lEl introduced a model of evolution of 
binary characters. In this model, the evolution of characters is governed by the 
Ising model on the tree of species. It is assumed that characters evolve identically 
and independently. In statistical physics, the study of the Ising model on trees, 
which dates back to the first half of the 20'th century, focused mostly on regular 
trees (named "Bethe lattice" , "homogeneous trees " or "Cayley trees" in statistical 
physics), and only more recently on general trees |22l2n]. However, the problem 
of reconstructing a tree from samples of data at the leaves was not studied in this 
context. 

The threshold for the extremality of the free measure for the Ising model on 
the tree will play a crucial role for phylogenies. The study of this threshold was 
initiated in [SOI El- The exact threshold for regular trees was found in 0], see also 
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1.1. Definitions. We begin by defining the evolution process. For a tree T ~ 
{V,E) rooted at p € we direct all edges away from the root, so that edge e is 
written as e = (w, u), where v is on the unique path connecting p to u. 

Let ^ be a finite set representing the values of some genetic characteristic. Il- 
lustrative examples are A = {A, C, G, T}, or ^ = {20 amino acids}. We will often 
refer to the elements of A as colors. 

The propagation of the genetic character a from p to the nodes of the tree T 
is modeled in the following manner. The root color is chosen according to some 
initial distribution tt, so that P[(Tp — i] — iTi. The mutation along edge e is 
encoded by a stochastic matrix iMfj)^^^^. For edge e — {v,u), it holds that 

Pio'u — ilcu ^ i] ~ M^^j'^\ Moreover, if path(u,-y) = {u = vo,...,vi = v}, is 
the path from u to w in T, and A(v) — {w : v (jz. path(p, w)}, then it is assumed 
that (dt,) satisfies the following markov property (see e.g. |1HI I21| for the general 
definition of markov random field): 

One of the fundamental problems of mathematical and computational biology 
is the reconstruction of phylogenetic trees. Our model of evolution is defined on 
a rooted tree. However, we will only consider the reconstruction of the un-rooted 
tree and ignore the problem of reconstructing the root. Indeed, in many cases it 
is impossible to reconstruct the root given the data at the leaves. For example, if 
all the are reversible with respect to the same distribution tt (which is also the 
initial distribution), then without additional data or assumptions on the model, it 
is impossible to distinguish a root. 

For a tree T = iy, E), we call v &V a. leaf, if v has degree 1 in the graph {V, E). 
We write dT for the boundary of the tree, i.e., the set of all leaves of T. 

Let T = {V, E) be a tree on n leaves. Consider the evolution process on T, 
where we consider T as a tree rooted aX p & V , where p is not one of the leaves 
of T; let {M'^)e^E be the collection of mutation matrices and (7ri)ig^ the initial 
distribution at p. For a coloring a on the vertices of the tree, denote by agr 
the values of the color at the boundary of the tree. Suppose that k independent 
samples of the above process, {(7l)i<t<k;vGT are given. In biology it is common to 
call these samples characters, we refer to these either as samples or as characters. 
The objective is to find T given the samples at the leaves io'grp)^^^. 

The standard assumption in phylogeny is that all the internal degrees in T are 
3; it is also assumed that all rooted trees are rooted at internal vertices, see, e.g., 
1221 . We will slightly relax the first assumption. 

Assumption 1.1. We assume that the evolution process is defined on rooted trees 
T , such that all internal degrees ofT at least 3, i.e., for all v € T, either deg(w) > 3, 
or deg(i;) = 1. It is also assumed that the root p is not a leaf, i.e., deg(p) > 3. 

We give the following two equivalent formal definitions of "topology" . 

Definition 1.1. • Let n be a positive integer and 

— T be a tree with labeled leaves vi, . . . ,Vn, so that vi is labeled by i. 

— T' be a tree with labeled leaves v[, . . . ,v'n, so that v[ is labeled by i. 
We say that trees T and T' have the same topology if there exists a graph 
isomorphism ip : T ^ T' , such that (p{vi) = w^, for i = 1, . . . , n. 
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• Equivalently, the topology of T is determined by the pairwise distances 
{d{vi,Vi))^ j^^, where d is the graph-metric distance. 

• For a tree T = {V, E) with labeled leaves wi, . . . we write top(T) for 
the topology ofT, i.e. the equivalence class ofT under the relation of same 
topology ('or top(r) is the array of pairwise distances {d{vi,Vi))^ ^^^). The 
topology of a tree T rooted at p, is the topology of the un-rooted tree T. 

Naturally, it is impossible to reconstruct the topology with probability 1. 

Definition 1.2. Let n be a positive integer and 

• T &e a family of rooted trees on n labeled leaves. 

• M 6e a set of \A\ x |^| stochastic matrices. 
We write T (g) M /or 

T (g) M = {(T, (Ar)ee£;) : T G T and Ve G E{T),M'' £ M}. 

We say that it is possible to reconstruct the topology from k samples with probability 
1 - 6, if there exists a map ip : (|^|")'' -> top{T) = {top(T) : T G T}, such that for 
all (T, (M'^)e) e T g) M, if (crgjn)^^]^ are k independent samples at the leaves, then 

P [V' {{<jlT)t=i) = top(T)] > 1 - <5. 

In this case we say that tp reconstructs the topology for T (g) M (from k samples 
with probability \ ~ 5). See 0) for a diagram representing ip. 

T(g)M > a 

(1) ^ I®*" 

(4)t=i < (^*)ti 

Note that this is a strong definition of reconstruction. In particular, if -0 satisfies 
Definition 11.21 then for any distribution of trees and matrices which is supported 
on T g) M, tp reconstructs the underlying topology with probability at least 1 — 6. 
Also note that in applications it is desirable that tp will have a simple algorithmic 
implementation. 

1.2. A Conjecture. Our results are motivated by the following fundamental con- 
jecture. 

Conjecture 1.2. Assume that the mutation matrices have a single order param- 
eter 6, and let 9{e) be the order parameter for the mutation matrix of edge 
e. Consider a markov random field on the 6+1 regular tree where the mutation 
matrices on all edges have the same parameter 9. Suppose that there exists 9c, such 
that 

• If 9 > 9c, then the markov random field is in an ordered phase (in some 
technical sense, see below), 

• If 9 < 9c, then the markov random field is in an unordered phase. 

We conjecture that the minimal number of samples needed in order to reconstruct 
phylogenies for the family of all trees on n leaves, where all internal degree are at 
least 6+1, is 

• k = (c(0)+o(l)) logn, if for all edges of the phylogenetic tree, 9(e) > 9 > 9c. 

• k = n'^W+o{^) ^ if for all edges of the phylogenetic tree, 9{e) < 9 < 9c. 
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In the above conjecture the desired reconstruction probabihty is 1 — (5, for some 
fixed < (5 < 1. A more formal conjecture wiU require to specify how "order" is 
measured. One possibihty is to caU a measure ordered for a specific value of 9, if 
the free measure on the infinite tree is extremal (see subsection II. 5() . Another, is 
to look at spectral parameters of the mutation matrices, and let 6{M) — |A2(M)|, 
where |A2(Af)| is the second largest eigen- value of M in absolute value. In this 
case it is natural to define 9c by h9l = 1, see 1501, EEl El El ■ Following the results 
reported here, further support for Conjecture II .21 was found in |25l I27| 

1.3. The CFN model. Below we focus on the model where M consists of all 2 x 2 
matrices of the form 

(2) M^=fV^^^^) f),/ 

where < e(e) < 1/2 for aU e. We find it useful to denote 9{e) = l-2e(e). Without 
loss of generality, we name the two colors —1 and 1. This model is referred to as 
the Cavender-Farris-Neyman (CFN) model. It was studied |28l 1111 IS], where it is 
shown that if for all e it holds that e < ^(e) < 1 — e, then the underlying topology 
can be reconstructed with probability 1 — S using k = poly^ g{n) samples. In |,S1) 
this result is generalized to mutation processes on any number of colors, provided 
thatM'^ satisfy det(M'') ^ [-1, -1 + e] U [-e, e] U [1 - e, 1] for all e. The dependency 
of fc on ^ and e is not stated explicitly in these results. 

It is desirable to minimize the number of samples needed for reconstruction. 
Since the number of trees with n leaves is exponential in Q(nlogn), and each 
sample consists of n bits, it is clear that il(logn) is a lower bound for the number 
of samples. 

In IHIIHI it is shown that for the CFN model 101 if for aU e, 1 > 6'max > 9{e) > 
9min > 0, then it is possible to reconstruct the tree T with probability 1 — (5, if 

where d{T) = 6 (depth of T) and c = c{d). 

For many of the trees that occur naturally in the reconstruction setting, the 
depth of the tree is 0(logn). Bound Q on /c is therefore k = n'-"-^\ which doesn't 
improve previous bounds. On the other hand, looking at families of random trees, 
d{T) is typically O(loglogn), and therefore by Q a fc = polylog(n) number of 
samples suffice for reconstruction of a typical member of these families. 

1.4. Phase transition for the CFN model. This paper is motivated by the 
following problem: When is the number of samples needed in order to reconstruct 
the topology of T polynomial in n and when is it poly-logarithmic in n? The hardest 
case in the analysis of jHUHl is that of a balanced tree. We will focus on balanced 
trees below. 

Definition 1.3. A tree T rooted at p is balanced, if all the leaves of T have the 
same distance to p, i.e., there exists an r such that 

dT^{ve V(T) : d{v,p) = r}. 

We first focus on the case where the mutation rate is the same for all edges. 
The following two theorems already indicate the importance of certain "phase- 
transitions" for the problem. For 6 > 2, we let T^((6 -I- l)b'^) denote the space of 
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all balanced rooted trees on n = {b+ 1)6^ leaves, where all the internal degrees are 
exactly 6+1. We cah a tree in T^((fe + 1)^'^), a (q + l)-level {b + l)-regular tree. 

Theorem 1.3. Consider the tree reconstruction problem for the CFN model on the 
space T^(n), where n = {b + 1)6'', and for all e, 9{e) = 0, is independent of e. If 
b9^ > 1, then there exists eg < oo such that for all S > 0, it is possible to reconstruct 
the topology from k samples with probability 1 — (5, where k = ce(logn — log (5). 

This result could not be extended to 6 such that b9^ < 1, as the following theorem 
implies that reconstructing balanced trees actually requires polynomial number of 
samples when the mutation rate is high. 

Theorem 1.4. Suppose that bO^ < 1, then there exists qo such that for all q > qo, 
the tree reconstruction problem for the CFN model on the space T^(n), where n — 
{b + 1)6^, q > qo and for all e, 9{e) = 9, satisfies the following. 

Given a uniformly chosen tree from TJ((6+ 1)6'^) (assume that the initial dis- 
tribution of the color at the root is uniform, ±1 with probability 1/2 each) and k 
samples of the coloring at the boundary of the tree, the probability of reconstructing 
the topology is at most 



Theorems 11.31 and 11.41 indicate the importance of the study of the phase tran- 
sitions for the Ising model on trees, where an interesting phase transition occurs 
when b9^ = 1, see Subsection 11.51 for more background. 

Later we generalize Theorem 1 1.31 to the standard model on balanced trees. Let 
T>fj(n) be the space of all balanced rooted trees on n leaves, where all the internal 
degrees are at least 6 + 1 . 

Theorem 1.5. Consider the tree reconstruction problem for the CFN model on the 
space Tt,^(n). Suppose that 6'min satisfies > 1, and that all edges e satisfy 

that 9min < 9{e) < ^max < L Then there exists a constant c = c(6'min, 6'max) = 
c(^min)/(l — ^max)^ < OO, such that for all 6 > 0, it is possible to reconstruct the 
topology with probability 1 — 5 from k = c(log n — log 6) samples in poly^ g^.^ q^^^ (n) 



Theorem 11.51 implies in particular a conjecture of Steel [231 for balanced trees, 
which initiated this work. We believe the Theorem 11.51 could play an important 
role in proving the analogous result for general (non-balanced) trees. We can also 
prove an upper bound for the number of samples when b9^^^ < 1. 

Theorem 1.6. Consider the tree reconstruction problem for the CFN model on the 
space T^^^(77,), and let q = log^ n. Suppose ^max < 1, "mill 

satisfies g'^ < b9'^^^ < 1, 

and all edges e satisfy ^min ^ 9{e) < ^max. Then for all S > 0, it is possible 
to reconstruct the topology with probability 1 — S given k — c{9 nun, 9 ^i^^, S, g) g"*' 
samples in poly^g^^.^ emax('^) 

Theorem II .41 also implies a lower bound on learning the tree in the PAC setting, 
see El 

1.5. Phase transitions for the Ising model on the tree. The extremality of 
the free measure for the Ising model on the regular tree plays a crucial role in this 




time. 
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paper. The study of the extremahty of the free measure begins with iSOmni- Latter 
papers include ^EBEIES] (see ^O] for more detailed background). 

Consider the CFN model on a g-level {b + l)-regular tree, where 9{e) = 6 for 
all e, and where the root is chosen to be each of the two colors with probability 
1/2. This measure is known in statistical physics as the free Gibbs measure for the 
Ising model on the homogeneous tree (or Bethe lattice). Note in particular, that 
the tree topology is fixed in advance. Given a single sample of the colors at the 
leaves of the tree, we want to reconstruct some information on the root color. The 
basic question is whether amount of information that can be reconstructed decay 
to 0, as q increases. Let cTp denote the color of the root, and aq denote the colors 
at level q. It turns out that the following conditions are equivalent. 

• /(dp, ffq) — > 0, where / is the mutual information operator. 

• The total variation distance between the distribution of given ap — 1, 
and the distribution of aq given Cp = — 1 decays to as g — > oo. 

• For all algorithms, the probability of reconstructing cTp from aq decays to 
1/2 as q ^ oo. 

• The free Gibbs measure for the Ising model on the infinite (6 + l)-regular 
tree is extremal. 

(see 20] for definitions and proof of the equivalence, and for this equivalence 
for general Markov random fields on the tree). 

The extremality phase transition may be formulated as follows. When b9^ > 1, 
some information on the root can be reconstructed independently of the height of 
the tree, i.e., none of the equivalent conditions above hold. When b9^ < 1, all of the 
above conditions hold, and it is therefore impossible to reconstruct the root color, 
as g ^ cxD. 

Theorem 11.31 is based on algorithmic aspects of this phase transition discussed 
in (23) . while Theorem 1 1 . 41 utilizes information bounds from |l()j . 

1.6. Paper outline. In Section[21we give a short proof of Theorem I 1 . 31 as it demon- 
strates some of the key ideas to be applied later in Theorem 11.51 and Theorem 1 1.61 
In Section |3| we prove Theorem 11.41 The proof uses information bounds and is 
somewhat independent from the other sections. In Section 0] we study in detail 
the behavior of majority algorithms for local reconstruction as the main technical 
ingredient to be used later. Section |5l contains some basic results regarding large 
deviations and four-point conditions. In section|Hlwe present the proofs of theorems 



We start by proving Theorem 11.31 We first define formally the function Maj. 
Note that when the number of inputs is even, this function is randomized. 

Definition 2.1. Let Maj : {-1, l}"* {-1, 1} be defined as: 



OandOl 



2. Logarithmic reconstruction for fixed 9 
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where to is an unbiased ±1 variable which is independent of the Xi. Thus when d is 
odd, 

d 

Maj(a;i, . . . ,a;d) =sign(^a;0. 

1=1 

When d is even, 

d 

Maj(a;i, ...,Xd)= sign(^a;j), 

1=1 

unless X^iLi — 0; ''^ which case Maj(xi, . . . ,Xd) is chosen to be ±1 with proba- 
bility 1/2. 

For 6 > 2, we call a tree T rooted at p, the i level b-ary tree, if all internal nodes 
have exactly b descendants and all the leaves are at distance £ from the root. 

Lemma 2.1. Let b and 9 be such that bO^ > 1. Then there exists £ — £{0), and 
^ > Vo — Voi^i 0) > 0, such that for all rj > rjQ, the CFN model on the £-level b-ary 
tree T with 

• 9{e) — 9 for all e which is not adjacent to dT , and 

• 9{e) — 9ri for all e which is adjacent to dT. 
satisfies 

E[+Maj(o-aT)|CTp = +1] = E[-Maj(aaT)kp = -1] > Vo- 

This follows from Section 3 of Lemma lTTI also follows from the more general 
Theorem 14.11 below. 

The only other tool needed for the proof in this case are standard large deviations 
results, see, e.g., P Corollary A. 1.7]. 

Lemma 2.2. Let S = X]i=i^«' where Xi are i.i.d. {—1,1} random variables. 
Then for all a > 0, 

P[\S-nS]\>a]<2eM-^)- 

Definition 2.2. Let T be a balanced tree. 

• The ^-topology of T is the function d* : 5T x 9T ^ {0, . . . , 2£ + 2}, defined 
by d*g(u, v) — min{c?(M, u), 2^ + 2}. 

• We let La-^ = {veT: d{v, dT) = i}. 

• The £ labeling of T is the labeling o/uf^Qig_i, where v G Lg-i is labeled 
by 

dT{v) ^{wedT: d{v,w) = i}. 

Note that for a balanced tree T, the ^-topology of T determines the ^-labeling of T 
-fori < £ the labels of are given by the sets {{w' S dT : dg{w,w') < 2i} : w G dT}. 
Moreover, if u, u e V, d{u, dT) < £ and d{v, dT) < £, then w is a descendant of u 
iff dT{v) C dT{u). 

The core of the proof of Theorem 1 1.31 is the following lemma. 

Lemma 2.3. Let b and 9 be such that b9^ > 1. Let £ and rjo be such that Lemma 
\2.1\ holds, and assume that t] > rjQ. Consider the CFN model on the family of 
balanced tree of q levels, where all internal nodes have at least b children and the 
total number of leaves is n. Assume that : i? — s- [0, 1] satisfies 

• 9{e) = 9 for all e which is not adjacent to dT, and 
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• 9{e) — Orj for all e which is adjacent to dT. 

Then 

• given k independent samples of the process at the leaves of T , {cr*Qj,)^^i, 
and £ < q, it is possible to recover the £-topology of T with error probability 
hounded by 

(5) exp(— c* /c), 

where c* = ?7^0'*^(1 - 6(2)2/8. 

• For all T and i > 0, there exists a map ^ = ^'t : {±1}^^ ^ {±l}^'^-^>^ for 
which the following hold. 

If a is distributed according to the CFN model on T, and a' = '^'{adT), 
then (cr(,)^gi,g_.j = {o'vTv)veLo-iey where are i.i.d. variables. Moreover, 

are independent of ((Tv)d(^v,dT)>u o^'iT-d satisfy E[t^] > 770. 

The map ^ may be constructed from the {i£)-topology ofT. In particular, 
if Ti and T2 have the same {i£) -topology, then ^'j'l = • 

Proof. Let c{u, v) be the correlation between u and v 

1 

t=\ 

Suppose that d{u,v) = 2r. Then E[c(u,w)] — ar, where a,. = rj^O^^. We let 

ifl<r<£, 
-1, iir^£+l. 



2 ' 2~ 



Since kc{u,v) is a sum of k i.i.d. ±1 variables, it follows from Lemma [2. 21 that for 
all u and v, 

P[c{u,v) ^ Idiu,v)/2\ < max^2exp I -— Ik 

= 2 exp {^kf]^d^'{l - 9^/8) < 2 exp {-k^O^'il - O^f/S) . 

Note that the intervals (/r)r=i ^re disjoint. Define D1{u,v) — 2r, if c{u,v) € Ir- 
Then D^{u,v) = d*^{u,v), for all u and v, with error probability bounded by jSJ, 
thus proving the first claim of the lemma. 

We now prove the second claim by induction on i. The claim is trivial for i = Q 
as we may take \1/ to be the identity map. 

For the induction step, suppose that we are given Label all the vertices v € 

yJ'^^Q^'^ LQ_j by the {i + l)£-labeling of T. By the induction hypothesis, there exists 
a map 4*' such that '^'{o-qt) — io'vTv)veLgT-uj where are i.i.d. ±1 variables. 
Moreover, are independent of {a'v)d{v,dT)>ii and satisfy E[ti,] > 770- 

By the properties of the labeling, for each w € LQ_(^i^iy, there exists a set 
R{w) C Lg^a, which is the set of leaves of an ^-level 6-ary tree rooted at w. 

We now let *(craT) = {'^w)weLg_u-t^ where 

CT„ = Maj((^''(CTaT))t, : V e R{w)) ^ Maj(cr„r„ : v e R{w)). 

By Lemma it follows that {aw)w£La-ie-i = {'^wTw)wGLg-u-e: where are 
i.i.d. ±1 variables. Moreover, r^, are independent of {o^y)d(v,^T)>ie+£ and satisfy 
E[Tt„] > ?7o, proving the second claim. ■ 



PHASE TRANSITIONS IN PHYLOGENY 



9 



Proof of Theorem II. 3t Let b and 9 be such that b9^ > 1. Let £ and t]o be 

such that Lemma [2.11 holds . Note that if i£ > q, then d*^ — d. Therefore, in order 
to recover d, it suffices to apply Lemma |2 . 31 recursively in order to recover d*^, for 
i = 0,...,\q/£]. 

It is trivial to recover dQ^v, u) = 21„^„. We now show how given d*f> and the sam- 
ples (o'g)t=i, we can recover d*g_^f. with error probability bounded by exp(— c*A:)/&^^. 

Let : {±1}^-^ {±1}-'^^-" be the function defined in second part of Lemma 
El given d*^. 

Then 

(*»(4T))ti = i<^l4 ■■ « e L9_a)t=i, 
where r* are i.i.d. variables with E[t*] > r/o- Moreover, r* are independent of 
(cr* : d{v,dT) >i£,l<t< k). 

By the first part of the lemma, given (cr*T* : v G Lg^ii)f^i, we may recover 

d' : Lg^u X La^u {0, . . . ,2£ + 2}, 

defined by d'{u, v) = min{d{u, w), 2£+2}, with error probability bounded by exp(— c*A;) /b 
Note that 

(6) 

J* . s _ / d*^{u,v) if d*f,{u,v) < 2i£, 

^^+^^ ' ^ " \ d'{u', v') + 2i£ if M e dT{u'),v e dT{v'),{u', v'} C Lq^u. u' ^ v' . 

Thus given d*^, by recovering d' , we may recover d*^_^^. 

Let Ai be the event of error in recovering d*^^^ given d*^ and a — J^llo^ P[^i]- 
Then the probability of error in the recursive scheme above is bounded by a and 

a < exp(-c*fc) (n^ + /b^^ + /b'^'^ + ■■■) <2n'^ exp(-c*/fc). 
Defining c' ^ — c* , and taking 

k = log(2»^j^-l°g'^ = (2 log „ + log 2 - log 5) 
we obtain a < 5. The statement of the theorem follows by letting ce = 3c'. ■ 



2il 



3. Polynomial lower bound for bO'^ < 1 

In this section we prove Theorem II .41 via an entropy argument. Let X and Y be 
discrete random variables. Recall the definitions of the entropy of X, H{X), the 
conditional entropy of X given Y, H{X\Y), and the mutual information of X and 
Y, I{X,Y): 

H{X) = -Y,J>[X = x]\og^V[X ^ x], 

X 

H{X\Y) = EyH{X\Y = y) = H{X,Y)-H{Y), 

I{X,Y) = H{X) + H{Y)~H{X,Y)^H{X)-H{X\Y)^H{Y)-H{Y\X). 

(see e.g. [S] for basic properties of H and /). 

The core of the proof of Theorem 11.41 is the fact that for the g-level 5-ary tree, 
if bO'^ < 1, then the correlation between the color at the root and the coloring of 
the boundary of the tree decays exponentially in q. We will utilize the following 
formulation from ^H! (see also |H[T7j\ 
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Lemma 3.1 (JOI)- Let a be a sample of the CFN process on the q-level b-ary tree. 
Then 

I{ap,aaT)<bW^'^. 

We will also use some basic properties of /, see e.g. 0. 

Lemma 3.2. Let X, Y and Z be random variables such that X and Z are inde- 
pendent given Y , then 

(7) /(X, Z) < niin{/(X, y), I{Y, Z)} ("Data Processing Lemma"), 

(8) I{{X,Y),Z)^L{Y,Z), 

(9) I{{X,Z),Y)<LiX,Y)+I{Z,Y). 

It is well known that if I{X, Y) is small, then it is hard to reconstruct X given 

Y. 

Lemma 3.3 (Fano's inequality). Let X and Y be random variables s.t. X takes 
values in a set A of size m, Y takes values in a set B, and 

(10) A = A(x,r)= sup P[/(r) = x], 

is the probability of reconstructing the value of X given Y ( the sup is taken over all 
randomized functions). Then 

(11) if(A) + (1 - A) log2(m - 1) > H{X\Y), 
where H{A) = -Alog2 A - (1 - A)log2(l - A). 

It is helpful to have the following easy formula. 
Lemma 3.4. The number of topologies for £ level b-ary trees on b^ labeled leaves, 

In particular, if £ > b^ , then 

(13) \ogntop{£)>b''Hog{b'). 
Proof. Clearly ntop(l) = 1, and 

_ n,^,{£ -!)>' ( b^ 
'"P^^' ~ bl ■ ■ ■ b^-\ 

We therefore obtain (|12|l by induction. To obtain (|13|l . we note that by Stirling's 
formula, for £ > 6'^, 

hi _ 1 

\og{ntop{£)) = log(6^!)-log(fe!)^&^ >6Mog(60-&'--^log(6!) 
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Proof of Theorem 11.41 We first note that given the topology of a tree T £ 
T^((6 + 1)5'^), the root of T is uniquely determined as the unique vertex which has 
the same distance to all the leaves. Therefore if T, T' g T^ ((5+ 1)6') have the same 
topology, then T and T' are isomorphic as rooted trees, i.e., there exists a graph 
homomorphism -0 of T onto T' which maps leaves to leaves of the same label and 
the root of T to the root of T' . In the proof below we won't distinguish between 
the topology of T and T. 

Assuming hff^ < 1, we want to prove a lower bound on the number of samples 
needed in order to reconstruct the tree, given that the tree is chosen uniformly at 
random. Clearly, the probability of reconstruction increases, if in addition to the 
samples we are given additional information. We will assume that we are given 
the [q — i + l)-topology of tree, i.e., for all u,v £ dT, we are given d*{u,v) — 
mm{d{u, v), 2{q — £) + 4}; the value oi £ < q will be specified later. 

Given d* , we have the {q — £ + l)-labeling of the nodes of — {v : d{v, p) = 
£}. Therefore, the reconstruction problem reduces to reconstructing an element of 
T^{{b + 1)6^""'^) on the set of labeled leaves {dT{v) : v G Lg}. Moreover, it is easy 
to see that given d*, the conditional distribution over T^{{h + is uniform. 

Recall that tig = (cr* : v £ dT). We may generate ((Tg)^^j^, in the following 
manner: 

• Choose T £ Tl{{b + l)b^^^) uniformly at random. 

• Given T, generate the samples at level I, i.e, = (u* : v G Lg), for 
I <t <k. 

• Given (cl : 1 < t < k), generate {ag)'l^i. 

We conclude that T and ((Tg)jL]^ are conditionally independent given (cr|)j^]^. 
In particular, by the Data processing Lemma l(7|l. 



(14) 




Since a' 



* and cr* are independent for t ^ t' , 



k 



(15) 




Let CTgy^^j be the coloring of the set dT{v). Note that (ctq 
conditionally independent given a\. Therefore, by J^, 



dT(v) 



: V G Lg) are 




Finally note that (ct^ : w £ Le,w ^ v) are independent of ct^ 
therefore for all v £ Li, 



'dT{v) given a- 



and 



(17) /(cr^,(Tgj,(„-|) = /(cr*,(Tgr(„-|). 



Combining IHJ), (HHJl, and |T7I) we obtain 



k 



/(T,(4)ti)<EE^(^-^0T(.))- 



t=i veLt 



By Lemma IXn /(ct*,ct| 



) < 59-^+1612(9-^+1) < 59-^5/2(9-£) Therefore 
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Letting m' be the number of topologies for trees in T*((6 + l)b^ we see that 

H {T I (4 )ti) = H{T) - / (T, (4)ti) > log2 m' k{b + l)b'^e'('^-'\ 

By lemma rOl we conclude that the probability A = A (T, ((Tg)^^^), of recon- 
structing T given (crg)f^=i, satisfies 

(18) H{A) + (1 - A) log2 m' > log^ m! - k{b + l)bW^(i-^\ 

The rest of the proof consists of calculations showing how to derive I0J from p8[l. 
Clearly m' is at least m — ntop{£)- Rewriting (|18|l . we obtain 

H{A) + k{b + l)6«02(g-f) > ^ iQg^ ^> > ^ iQg_^ 
from which we conclude that 



(19) A < max 



(2H{A) 2fc(6 + 1)&96'2(9-^) 
I loga m ' log2 m 



Note that -(l-a;)log(l-x) <xfora;e [0, 1], and therefore _ff( A) < -AlogaA-l- 
Alog2(e). Thus if A < 2ff(A)/log2m, then 0.5Alog2(m) < -A log2 A + A log2(e), 
or A < ej^fm. So by p9|l . we obtain 



e 2fc(&+l)6«6l' 



,9fl2(g- 



(20) A<.._ 

|_ Vm log2 TO 

Therefore, if ^ > fe^ (say) then by Lemma [3.41 

A < max (exp (l - ^.hb'-^ogb') , 2fcfc(& + 1)^ )^'M < max{exp(-6^+i), ^(W^)?-^}. 
I logsfc*^ J 

We now take £ = [log;, q + logj(— log b9'^)\ , so exp(— 6^+^) < {bd'^y. Since we have 
the freedom of choosing we conclude that 

A < k{b9^Y^'^°^'''^^'^°^''^^^°^^^^\ 

for large q as needed. ■ 

4. Majority on trees 

In this section we analyze the behavior of the majority algorithm on balanced 
6-ary trees. Theorem 14 . 1 1 will be used later in the proof of theorems 11.51 and 11.61 

Definition 4.1. Let T = {V,E) be a tree rooted at p with boundary dT . For 
functions 6' : E ^ [0, 1] and rj' : dT ^ [0, 1], let CFN{9', rj') be the CFN model on 
T where 

• 9{e) — 0'{e) for all e which is not adjacent to dT, and 

• 0{e) = 0ie)r]'{v) for all e = {u,v), with v S dT. 

Let 

M^i{9',T]') = E[+Maj(aaT)kp = +1] = E[-Maj((TaT)kp = -1], 
where a is drawn according to CFN(9\i]'). 

For functions 9 and rj as above we'll abbreviate by writing min0 for min^; 0(e), 
max?7 for max^g^T ?7(i')j etc. The function Maj measures how well the majority 
calculates the color at the root of the tree. 
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Theorem 4.1. Let 

(21) a{d) = 2'-'^\^}(^Q. 

For all I integer, 6'min G [0, 1] and < a < 0(6^)6'^^;^^ , there exists (3 = f]{b, £, 0min, en) > 
such that the following hold. Let T be an l-level balanced b-ary tree, and consider 
the CFN{9,r]) model on T, where min^ > ^min cind niin?7 > rymin- Then 

(22) MSj(0, 77) > min{ar;„,i„, /?}. 

In particular, given b and 9min such that bO^^^ > > 0, there exist £{b,0min), 
a(6, ^min) > g^ o,nd /3(6, 0inin) > 0, such that any CFN{9,r]) model on the l-level 
b-ary tree satisfying ToxaO > 9min find minry > rjmin must also satisfy ^2^) 

Theorem 14. II is a generalization of Lemma [2 .11 

Proof of Lemma 12.11 23 : Follows immediately from the second assertion in 
Theorem 14.11 where g = I, and 770 (of Lemma 12. If) is chosen between and /? (of 
Theorem ■ 

The following lemma is a generalization of the second claim in Lemma 12.31 

Lemma 4.2. Letb andO^in be such that b9f^^^^^ > g^ > 0. Let£{b,9^in), «(&, ^min) > 
g^ and (3{b, 0„iin) > 0, be such that \2'/^ holds. 

Consider the CFN{9,r]) model on the family of balanced tree of q levels, where 
all internal nodes have at least b children, a total of n leaves, min 9 > ^min o-nd 
minrj > /3. 

Then for all T and i > 0, there exists a map 
which the following hold. 

If a is distributed according to the CFN{9,rj) model on T, and a' = '^{tqt), 
then {<j'^)v^Ls-it = {<^vTv)v&La-ii, where Tv are independent variables. Moreover, 
Ty are independent of {<Jv)d{v,dT)>u md satisfy E[t„] > min{l, g'^}/? for all v. 

The map \1/ may be constructed from the {ii) -topology of T. In particular, if 
Ti and T2 have the same {it) -topology, then "^Ti — ^T2- Furthermore, ^(ct^t) is 
computable in time polynomial in n. 

Proof. Similarly to Lemma 12.31 the proof is by induction on i. The claim is trivial 
for i = 0. 

For the induction step, suppose that we are given d*^^^. Label all the vertices 
v e uf^Q^'^La.j by the {i + l)Mabehng of T. 

By the induction hypothesis, there exists a map such that ^''(craT) — {<^vTv)v£ Loi 
where are independent ±1 variables, independent of {<yv)d{v,dT)>ie ^-nd satisfy 
E[r«] > min{l, g'^}/3 for aU v. 

By the properties of the labeling, for each w e we can find a set 

R{w) C Lg^ii, which is the set of leaves of an ^-level 6-ary tree rooted at w. 

We now let *(craT) = i^w)wGLg^u^e, where 

= Maj((^''(craT))t, : v e Riw)) = Maj(cr„T„ : v £ R{w)). 

By Theorem Em it follows that {aw)weLa_a-t = {^wTw)w(iLa_u-e^ where are 
independent ±1 variables. Moreover, are independent of {(Jv)d(v.dT)>ii+i and 
satisfy E[tu,] > min{l,5*^+^}/3, for all w, proving the second claim. 
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Note that in order to compute the function one appUes the majority function 
recursively starting at a subset of the leaves. Therefore 5" is computable in time 
polynomial in n. ■ 

The following Lemma shows why the second assertion of the Theorem l4.1l follows 
from the first one. 



,i.n ^ 1, 



Lemma 4.3. 

In particular, if bd"^ > , then a{b^)9^ > , for all sufficiently large 



Proof. Stirling's formula implies that 
a{d) 



^-^r|i(fji) = (i+«(i));|v/5. 



Now the claim follows. ■ 

The role that a{d) plays for the majority algorithm is presented in the following 
lemma. 

Lemma 4.4. (1) Let X,Yi, . . . ,Yii be a sequence of ±1 random variables such 
that Y2,...,Yd are i.i.d. with Y^[Yi] = 0, ^[Yi\X = 1] = ~'E[Yi\X = -1] = 
9, and Y2, . . . ,Yd are independent of X, Yi . Then 

(23) E[XMaj(ri, . . . , Yd)\X = 1] = E[XMaj(ri, . . . , Yd)\X = -1] = 9^. 

d 

where a{d) is given in ydl]) . 
(2) Let X,Yi, . . . ,Yd~i be a collection of random variables, where X is non- 
negative, Yi, . . . , Yd-i are symmetric and 

• Yi, . . . , Yd-i are independent, and 

• P[X > max, = 1. 
Then 

(24) E[sign(X + |^y,)]>^. 

1=1 

Proof. (1) Let Y be ail variable which is independent oi X,Y2, . . . ,Yd, and 

E[y] = 0. Let Z be a random variable which is independent oiX,Y2, . . . , Yd, Y , 
such that P[Z = 1] = 6*, and P[Z = 0] = 1 - 9. Note that (Fi, . . . , Yd) and 
{ZX + (1 — Z)Y , Y2, . . . , Yd) have the same distribution. Therefore, 

(25) E[XMa}iYi,...,Yd)\X = 1] = ^?E[XMaj(X, ^2, • • ■ , Yd)|X = 1] + 

(1 - 0)E[XMaj(f, Y2..., Yd)\X = 1]. 

Since Y ,Y2, . . . ,Yd are independent of X, it follows that 

E[XMaj(f,r2,...,Yd)|X-l] -0. 

Therefore by H25(l . in order to prove the Lemma, it suffices to show that if 
Y2, . . ■ ,Yd are i.i.d. ±1 random variables with E[yi] — 0, then 

(26) E[Maj(l,y2,...,>d)] = 
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It is helpful to note that E[Maj(l, Y2,..., Yd)] = P[sign(l + J2f^^ Yi) ^ 

sign(^^^2 ^j)]- There are two cases to consider: 

• d = 2e + 1 is odd: 

P[sign(l + E ^0 ^ sign(X: Y,)] = P[^ = 0] = 2"^^ (^j) = 

i=2 i=2 1=2 ^ ^ 

• = 2e is even: 

p[sign(i ^ sign(x:r.)] = piEi^. = -1] = 2-^'=+^^'; ^) = 

1=2 1=2 1=2 ^ ^ 

(2) The general case follows by conditioning from the case where for all i, Yi is 
a ±?/i random variable, X = a; is a constant and x > maxi<i<rf_i \yi\. 

li X = \yi\ = . . . = \yd-i\ then the claim follows from the proof of the 
first part of the lemma. We may therefore assume the strict inequality 
X > \yd-i\- 

We now show that it suffices to prove the claim for x,yi, . . . , yd-i such 
that 

d-1 



(27) P[x + Y,Y^ = 0]^0. 



Indeed, if P[a; + J^tJ Y ^ 0] > 0, let 

^ d-1 d-1 

e = - min{|coa; + ^ CiVil : Cqx + Ciyi ^ and a G {-1, 0, 1} for aU i). 

i=l 1=1 

Define y^ , y~ for \ < i < d — Ihy 



yf 



yi if i < d — 1 , 

yi ± e ifi = (i— 1. 



Let Yj^ be independent symmetric iy^ variables and let Y^ be indepen- 
dent symmetric ±y^ variables. Note that 



d-1 ^ d-1 ^ d-1 

(28) E[sign(a; + ^ ^0] - 2E[sign(x + ^ ^.+ )] + ^n^^Snix + ^ 

i—1 i—1 i=l 

X > maxi<i<d-i \yf\, x > and 

d-1 d-1 

(29) Pix + Y.Yi' = 0]=P[x + Y.Yr ^0]^0. 

i=l 1=1 

From H28|) and H29|l it follows that we may assume H27|l . 
By (|27|) and symmetry 



(30) 

d-1 d-1 

E[sign(x + ^ r,)] = 1 - 2P[E Y,<-x\ = l- P[E Y, < -x] - P[J2 Y, > x]. 



d-1 d-1 



i=l 
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Let U- C be defined as U- = {{bi, . . . ,bd-i) : J^biVi < -x} 

and [/+ = {(61, . . . , bd-i) : J2 biVi > x}. Rewriting (jSOI), we get 

d-l 

(31) E[sign(a; + ^ F,)] = 1 - P[C^-] - PP+l 

i=l 

where P is the uniform measure on {-1, l}'*"^ Note that if h denotes the 
Hamming distance, then 

h{U+,U-):= min /i(&+,6_)>2. 

By the isoperimetric inequaUty for the discrete cube ^21 (see ^3]; |3] for 
background) it follows that maximizers of the following quantity 

max{p : P[U] = F[U'] = p; U, U' C {-1, 1}'^"^ and h{U, U') > 2} 

are obtained as follows. 

• If d = 2e + 1 is odd, then the maximum is obtained for 

d-l 



(33) 



i=l 

Therefore 



U ^{b■.Y,b^>2}, 

1=1 

U' ^{b■.J2b^<-2}. 

2e\ _ a{d) 



(32) 1 - P[U+] - P[U-] > 1 - P[U] - P[U'] = 2 



-2e 



e / d 

• If d = 2e is even, then the maximum is obtained for 

d-l d-1 

[/ = {5 : ^ 5i > 3} U {6 : ^ 6, = 1 and 61 = 62 = 1}, 

2 = 1 1 = 1 

rf-1 d-1 

U' = {b:^bi< -3}U {5 ■■^bi = -1 and 5i = 62 = -!}• 

j=l i=l 

Therefore in this case 



-2e+i I (2e - 1\ /2e - 3\ \ ^^^e+i - 1\ _ a(rf) 



l~P[C/]-P[C/'] = 2x2-2'=+M - >2 



e 



e — 1 ) ) \ p j d 



(the assumption Ij/d-il < x resulted in a better bound here than in ±1 
case). 

Now (ED follows from and (P)). 

■ 

Lemma 4.5. Let T 6e f/ie l-level b-ary tree. Suppose that a < a{b^)9^^^. Then 
there exists e = e(6, ^min, a) > 0, s.t. if max rj < e, and min^ > f?min, then the 
CFN{6,ri) model on T satisfies 

(34) M^j(0,^)>«^£f|^. 
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Proof. In order to prove it suffices to show that for a < 0(6^)6*^;^, there exists 
e > 0, s.t. if max 77 < e, then for all v 

f-ir.\ ^Maj a 

(35) g^^{9,v)>^- 

Note that Maj(^?,77) is a polynomial in and 77. Therefore all the derivatives of 
Maj(0, 77) with respect to the 9 and rj variables are uniformly bounded. In particular, 
there exists a constant C (which depends on b and £ only) such that for all 9 and 
r] we have that 

(36) |f^(M)-|^(M)l<Cmax,.. 

077(1;) 07/(7;) I 

Therefore it suffices to show that for all 9 with miug 9{e) > 9jnin, we have that 

By this will imply l|35|) for all 77 satisfying max„ 77(7;) < e, where 

_ a(6^)Cin - « 

Fix 7; S c?r, and let ei, . . . ,ei be the path in T from the root p to 7;. Let 7 G [0, 1] 
and 

if Tj; 7^ 7;, 
7 if 7/; = ti. 

Maj(0, ?7o) is the covariance of the majority of i.i.d. ±1 variables and a-p. One 
of these variables, a^, satisfies E[cr^CTp] = 7 04=1 ^(^i), while all the other variables 
are independent of cTp and ay . Therefore by part 1 of Lemma H4.4|l it follows that 

So 



7/0(77;) 



1=1 



The assumption on 9 implies that Ylf^i 9{ei) > 9^^^-^^. Therefore, 



977(7;) 

to obtain 1)3 7|l. as needed. ■ 

Lemma 4.6. Let T be an (-level balanced b-ary tree. Suppose that minfl > ^min, 
and max 77 > T/max, then 

(38) Sfej(0,77) > -^^/7(0„i„)^-l/7(0^i„77^,,), 

where 

(39) h{x) = min{l. 



1-x' 
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Proof. We use the "random cluster" representation of the model (see (14} for back- 
ground on percolation and random-cluster models). Declare an edge e = (u, v) open 
with probability 9(e) if e is not adjacent to dT, and with probability 6{e)'q{v) if 
V e dT , independently for all edges. An edge which is not open is declared closed. 
Given the clusters (connected open components) of the random cluster representa- 
tion, color the root cluster by the root color, and each of the other clusters, by an 
independent unbiased ±1 variable. This gives the same distribution on coloring as 
the original coloring procedure (this should be clear; see e.g. [23 for more details). 
Assume that the root color is 1, and let Cp be the root cluster. Let 

X - ^ a„ = \C'p\ = \CpndTl 

veCp 

where C p = CpH dT. Let Ci, . . . ,Ck be all other clusters (note that K is a random 
variable) and let 

vec[ 

where C'^ = CiO dT. Conditioned on Cp, Ci, ... , Ck, 

p[Y,^±\c,ndT\] = 1/2, 

and the li's are independent conditioned on Cp,Ci, . . . ,Ck. Clearly, 

K 

(40) M^j(0,77)=E[sign(X + ^r,)]- 

1=1 

Note that conditioned on Cp,Ci, . . . ,Ck, the variable X]i=i symmetric and 
therefore, 

K 

(41) E[sign{X + Y,Yi) I IC'pl < max|C',|] > 0. 

j=i 

Moreover, below we prove that 

(42) P[|C'p| > max|CM] > 2-'b-'-^h{enunY-^h{0Mn..)- 

i 

When X > 0, there are at most 6^ — 1 non-zero variables among the Y^s. Therefore 
part 2 of Lemma \4 . 41 implies that 

(43) E[sign(X + Vk,) I IC'pl > max|C',|| > 



i—i 



Combining (|^ and (gj) via (gUIl we obtain: 



K 



Stoj(0,?7) = P[|C'p| <max|C',|]E[sign(X + Vy,) | |C'p| < max |C',|] 

i ^ ^ i 

2 = 1 

K 

P[|C'p| > max |C',|] E[sign(X -f V F.) I \C' p\ > max |C',|] 

i ^ — ' i 



> 



1=1 

a(50 



2^62^+ 

as needed. 



2 ^(^min) ^(^min^max); 
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It remains to prove (|42|l . Let be the probability space of all random cluster 
configurations. We prove (|42|l by constructing a map G : fl ^ fl such that for all 

Lu en, 

(44) |C'p(GH)| >max|C',(G(L^))|, 

i 

and for all uj 

(45) P[G-'(^)] < 5'+^2^/l(e,„i„)^~'/l(0minr?max)-'PH- 

If Lu satisfies \C'p\ > max; \C'i\, then we let G{u!) = lo. Otherwise, let C be a 
cluster such \C n dT\ — max; \C' i\. If max^ \C' i\ — 1, we let C be a cluster which 
contains a. v £ dT with r]{v) > rj^^ax- Let u G C he the vertex closest to the root p. 
Let G{uj) be the configuration which is obtained from oj be setting all the edges on 
the path from u to p to be open. 

It is clear that G{uj) satisfies (|44|l . Let uj be such that |C'p(ijj)| > max^ |Cj'(a-')|. 
Then any element in G~^{uj) is obtained by 

• choosing a vertex m G T such that either u ^ dT or u G dT and r](u) > r^max- 

• choosing a subset S of the edges on the path from m to p 

• Setting all the edges of S to be close. 

If uj' is the configuration thus obtained, then clearly, 

(46) P[C^'] < /l(0„n„)l-'/l(^,nin?7max)"'PH. 

It remains to count the number of lu' which may be obtained from lu. There are 
at most 6^+^ choices for u. Moreover, there are at most 2^ subsets of the edges we 
want to update at the second stage. Thus there are at most pre-images lu 

to consider, each satisfying (|46|) . We thus obtain (|45|l as needed. ■ 

Proof of Theorem 14.11 By Lemma 14.51 there exist e > such that if for all 
V e dT, it holds that r/min < ^ then 

Maj(6i, ry) > arymin- 
By Lemma 14.61 it follows that if max^^gx rj{v) > e, then 

where ^ 

P = 2£52£+l ^(^min)^~"^^(^mine), 

and h is given by H39|l . Now the first claim follows. The second claim follows from 
the first claim by Lemma El^ ■ 

5. Four point condition and topology 

In this section we discuss how to reconstruct the -^-topology of a balanced tree, 
given the correlation between colors at different leaves. The analysis in this section 
does not exhibit a phase transition when b0f^^^^^ = I, as 6'min has a continuous role 
in the bounds below. We follow the well known technique of "4-point condition" . 
However, as we require to reconstruct only the local topology, and consider only 
balanced trees, the number of samples needed is logarithmic in n. 

The following theorem generalizes the first part of Lemma 12.81 

Theorem 5.1. Let £ be a positive integer and consider the CFN{9,ri) model on 
the family of balanced tree on n leaves, where 
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I. For all edges e, ^min < 0(e) < 0max, where Bmin > and ^max < 1, and 
II. For all V G dT, rjmm < r/{v), where ?7min > 0. 

Then there exists a map $ from {±1}'^" to the space of (-topologies on n leaves, 
such that, 

((4T)ti) = ^ - topology ofT]>l-S, 
where {(Jg'j^)t=i are k independent samples of the process at the leaves of T , and 

(47) 5 < exp(-C* k ef^'vLni^ - ^max)'), 

with c* > 1/2048. Moreover, $ is computable in polynomial time in n and k. 

Definition 5.1. Consider the CFN{9,r]) coloring of a tree T. For any two leaves 
u, V, let 

6{u,v) = r]{u)r]{v) JJ^ 0{w), 

'UJGpath(u,D) 

and 

D{u,v) = -log e{u,v) ^ ~\og{i]{u)) -\og{r]{v)) - ^ \og{0{w)). 

Lemma 5.2. Suppose we are given k samples {<Jg'j')t=i! of the CFN{0,rf) coloring 
of a tree T as in Theorem \5.1\ For u,v d dT, let 



t=i 

(note that the expected value ofc(u,v) is 6(u,v)), and 

, , _ ( ~logc{u,v) ifc{u,v) > 0, 
'-"'^ " \ oo tfciu,v) < 0. 

Let 6^ = ^mi^'?min/2- Define uRv ifc{u,v) > 0^ and uR'v ifc{u,v) > ^6^. (uRv 
roughly means that u and v are close; uR'v means that u and v are even closer) 
and letl/A>e> 0. Then with probability at least 

(48) l-n2cxp(-fc6l:JeV8), 

I. uR'v for all u and v such that diu, v) < 2£ + 2. 

II. For all u and v, if there exists a w such that uRw and vRw then \D{u, v) — 
D*{u,v)\ < e. 

Proof. Define 

A = {3{u, v) s.t. \c{u, v) — 9{u, v)\ > a}, 

where a — eOl/2. We claim that conditioned on A'^, both I. and II. hold. If 
u,v E dT satisfy d{u, v) < 2£ + 2, then d{u, v) > 26'*. Therefore conditioned on A'^, 
all u,v £ dT s.t. d{u, v) <2l + 2, must satisfy 

c{u,v)>29^-eel/2> ^0,, 
8 

and I. follows. 
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Conditioned on A'', if uRw, then c[u,'w) > 0^ and therefore 9{u,w) > 9* — a. 
Similarly, if vRw, then 9(v,w) > 9^ — a. Now 

9{u,v) = r]{u)r]{v) J| 9{y) > \ 'q{u)'q{w) ]J 9{y) \ {■i]{w)r]{v) Y[ ^iv) 

y £p3Lth(u ,v) \ 2/epath(u,iij) / \ yGpath(i(j,u) 

= 9{u,w)9{w,v)> {9^^af, 
and therefore conditioned on A'^, 

c{u, v) > {9^ — aY — a. 
Therefore conditioned on A'^^ by the mean value theorem, 

\D*{u,v)- D{u,v)\ = |logc(u,t;) -log6l(u,w)| < -LA-^^^J^J.^ < 



— aY — a 

e9l/2 _ e 



h-e9l/2Y-e9l/2 2(1 - e0,/2)2 - e 



< e 



to obtain II. 

By Lemma [2 .21 P[A] is bounded by 

(49) V[A] < Q 2 eM-k9y/8) < eM-k9y /S), 

as needed. 

■ 

For a set V of size 4 a sp/zt is defined as a partition of V into two sets of size 2. 
We will write W1W2IW3W4 for the split {{wi, W2}, {wa, W4}}- Note that a 4-element set 
has exactly 3 different splits. 

Lemma 5.3. Let T — {V^E) he a balanced tree. Let A : E IR^ be a positive 
function. Foru,veV, let A{u,v) = J2eepa.th{u.v) ^i^)- For a splitT ^ uiU2\u^Ui, 
let 

A(r) = A(ui,U2) + A(m3,U4). 

Then 

• //Fi and F2 are two splits 0/ {ui, M2, U3, U4}, then either A(Fi) = A(F2), 
or |A(Fi) - A(F2)| > 2A,„in, where 

Amin = min{A(e) : e not adjacent to 9T}. 

• Let R he a binary relation on dT such that uRv whenever d{u,v) < 2£ + 2. 
Write R{v) for the set of elements which are related to v. Then in order to 
reconstruct the i-topology of the tree it suffices to find for all u G dT and 
all {ui, M2, U3, U4} C R{u), all minimizers of 

{A(F) : F a split 0/ {ui, U2, 1*3, U4}} 

(we call such minimizers minimal splits). 

Proof. Let L/ be a set of four vertices. Note that either there is a unique split 
W1W2IW3U4 of U such that path(ui,U2) n path(M3,M4) is empty, or for all splits 
W1W2IW3U4, the set path(Mi, U2) H path(u3, U4) consists of a single vertex. 
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Suppose that Ui G dT, for 1 < « < 4 and that path(ui,U2) H path(u3,U4) 
is empty. Let ui,2 be the point on path(ui,U2) which is closest to path(u3,U4). 
Define ^3^4 similarly. Then 

(50) A(ui,u3) + A(u2,u4) = A(ui,u4) + A(u2,u3), 

and 
(51) 

A(ui,u3) + A(u2,u4)- A(ui,u2)- A(u3,u4) =2 ^ A(e) > 2Ami„. 

eGpath(ui_2 ,113,4) 

If on the other hand, path(ui, W2) H path(u3, U4) consists of a single point, then for 
all permutations i,j,k,£ of 1,2,3,4, 

(52) A(uj, Uj) + A(mj,, Ui) has the same value. 

The first claim follows. 

Let p be the root of the tree and let q be the distance between p and the 
leaves (since the tree is balanced, the distance to all the leaves is the same). If 
q < £ + 1, then all u,v £ dT are R related. In this case, it is well known that 
that the topology of the tree may be recovered from all minimal splits (this is the 
classical "4 point method", see e.g. |H1). We assume below that q > £ + 1. Let 
Br{u) = {v : d{v, u) = 2r}. Note that Briu) C R{u), for aU u £ dT und r < £ + 1. 

Claim 5.4. d satisfies 

• d(u,v) = if and only if u = v. 

• For I < r < £, d{u, v) = 2r if and only if v £ R{u) \ Br-i{u), and for all 
{WjW'} C R{u) \ {Br-i{u) U Br-i{v)), the split uvlww' is a minimal split. 

Proof. The first part is trivial. 

For the second part note that if d{u,v) — 2r, then v E R{u). Moreover, for all 
WjUi' ^ {Br-i{u) U Br~i{v)), the intersection path(M, u) npath(w,w'), is either 
empty or consists of a single vertex. Therefore uv\ww' is a minimal split. 

If d{u,v) < 2r, then v ^ R{u) \ Br-i{u). 

Suppose that d{u,v) > 2r and v E R(u). Since the tree is balanced, all the 
internal degrees are at least 3 and r + 1 < £+1 < q, it follows that the sets Br{u) \ 
Br-i{u), and Br+i{u)\ {Br{u)U Br-i{v)) are not empty. Let u' E Br(u)\Br-i{u) 
a.ndv' E Br+i{u)\{Br{u)UBr-iiv)). Then w', it' E R{u)\{Br-i{u)UBr-i{v)) and 
path(M, u') n path(ti, v') is empty - therefore uv\u'v' is not a minimal split. ■ 

By Claim 15.41 from the minimal splits, we can recursively reconstruct for r = 
0, . . . , ^ all pairs u,v E dT such that d{u, v) = 2r. The second claim follows. ■ 
Proof of Theorem 15. H Note that by letting 

, . J — log(6'(e)) if e is not adjacent to dT, 

^ \ - logle{e)T]{v)) if e = (u, v) and v E dT. 

the metric D of Definition 15. II is of the form of the metric in Lemma 15.31 
Moreover 

Dmin min{£'(e) : e not adjacent to dT} > min — log0(e) > — log^max > l^^max- 

e 

Let e' = - log6'max/4 and e = (1 - 6lmax)/4. 

We condition on the event that I. and II. of Lemma [5 . 21 hold with e; so 2Djnin > 
8e' = 8e + 8(e' - e). Thus for all u and v such that d{u, u) < 2^ + 2 it holds that 
uR'v. 
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Note that there exists a symmetric relation R such that R' C R C R and such 
that for all u and v it is decidable in time polynomial in k if they are R related or 
not (to compute R it suffices to check if c{u, v) > |6', within accuracy 6*^/4). 

For a split T = U1M2IU3M4, write D*(r) for D*{ui,U2) + D*(u3,U4). 

Fix u e dT andU = {wi, U2, U3, "4} C R{u). For aU splits F of U, \D*{r) - 
D(r)\ < 2e. 

Let Fi,F2 be splits of [/. We claim that ^'(Fi) < ^(Fa) if and only if i:)*(Fi) < 
i:i*(F2) + 4eif andonlyif i:i*(Fi) < D*{T2) + 'ie' . Indeed, ifi:i*(Fi) <D*(F2) + 4e', 
then D{Ti) < D{T2) + 4e' + 4e < D{T2) + 2Anin, and therefore D(Fi) < i:'(F2), 
by the first part of Lemma lO If on the other hand, D*(Ti) > D*{r2) + 4e, then 
^(Fi) > i:'(F2), as needed. 

Moreover, given that either D*{Ti) > D*{T2) + 4e' or D*{Ti) < D*(F2) + 4e, 
we may find which of the two hold in time polynomial in k. Therefore, the minimal 
splits may be recovered in time polynomial in n and k. 

We therefore conclude that conditioned on I. and II. of Lemma 15.21 we may 
recover all the minimal splits of U, for all U C R{u) and all u € dT in time 
polynomial in k and n. 

It now follows from the second part of Lemma 15.31 and from Lemma l5 . 21 that we 
may recover the ^-topology of the tree with error probability bounded by (|48|) : 

k I k I 2 \^ _n \ 2^ 

2 / '^/i4 2\ 2 I"" / ''mill '/min I ( ^ C'max 

n exp{—-0;e ) = n exp ' ' ' ' 



= exp - 
as needed. 

Finally note that given the relation R and all the minimal splits, the reconstruc- 
tion procedure described in Lemma l5.3l is computable in time polynomial in n. We 
conclude that the function $ is computable in time polynomial in n and k. 



6. Reconstruction of balanced trees 

The proof of Theorem 1 1.51 is similar to that of Thcorem ll.3l The main difference 
is that instead of just calculating correlations in order to recover ^-topology, the 
4-point method, i.e., Theorem l5.1l is applied. The analysis of the majority function 
in the more general setting, i.e.. Theorem 14. II is also needed. 

Proof of Theorem 11.51 Let b and d^in be such that b9^^^ > 1. By Theorem 
14. II there exist i,a> 1 and /3 > be such that holds. 

To recover d, we will apply Theorem 15 . II and Lemma |4 . 21 recursively in order to 
recover d*^, for i = 0, . . . , \q/f\ , where q is the distance from the root of the tree to 
the leaves. 

We note that the algorithms in Theorem 15 . II and in Lemma |4. 21 arc polynomial 
time algorithms in k and n - since k is polynomial in ri, it follows that the running 
time of the reconstruction algorithm below is polynomial in n. 

Trivially, dQ{v,u) — 21v^u- We show how given d*^ and the samples {(Jg)t=i, we 
can recover d*^^^ with error probability bounded by exp(— cfc)/?)^*^, where 

(53) C=C*Cii;V'(l-^max)^ 
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and c* > 1/2048. 

Let : {il}^"^ {±1}^3-'* be the function defined in Lemma Ol given d*^. 
Then 

where r* are independent variables with E[r*] > f3. Moreover, r* are independent 
of (ct* : div,dT) >ii,l<t< k). 

By Theorem 15 .11 given (ct*t* : v € Lq^u)^^^, we may recover 

xia_,^^{0,...,2£ + 2}, 

defined by d'{u, v) — mm{d{u, v), 2£+2}, with error probabihty bounded by exp(— i 
As in Theorem II .31 it easy to write d*^^^ in terms of d*f, and d' . 

Letting Ai be the event of error in recovering d*^^^ given d*^, and a — J2i=o^ P [^'1 ' 
the total error probability is at most a. 
Now 

a < exp(-5fc) (n^ + n^/b^^ + n^/b^^ + ■■■) <2n^ c^p{-ck). 
Defining c'~^ — c, and taking 

(54) k= = c' (2 log n + log 2 - log S) 

c 

we obtain a < 6. The statement of the theorem follows from (|54|l and H53|) . ■ 
Proof of Theorem 11.61 The proof is similar to that of Theorem 11.51 Let 
/i2 ^ (g2 _,_ 66',^.^)/2, so that 66'2^i„ > /i^. We choose ^, a > /i^ and /3 > such that 
H22|l holds. The main difference from the proof of Theorem 11.51 is that when we 
recover (cr*T*)^gLa_if,i<t<fe, the r* are independent variables satisfying a weaker 
inequality, E[r*] > Ph'^. 
Therefore 

where c is given in (|53|) . and q = log^ n. 
li k = C5"^«, then 



5]P[A,] <2n2exp(-5c(V3)*'), 



which is smaller than S for all n, for sufficiently large value of c. ■ 
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